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Abstract 

The aim of this paper is an experimental study of cache systems in order to optimize 
proxy cache systems and to modernize construction principles. Our investigations 
lead to the criteria for the optimal use of storage capacity and allow the description 
' of the basic effects of the ratio between construction parts, steady-state performance, 

optimal size, etc. We want to outline that the results obtained and the plan of the 
experiment follow from the theoretical model. Special consideration is given to the 
C*~> ' modification of the key formulas supposed by Wolman at al. [11]. 
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1 Introduction 



This paper considers approaches to the problem of optimization of proxy 
cache construction based on theoretical and experimental study. Until re- 
cently, caching was an optional service for users who voluntary configured their 
browsers to redirect request through a proxy. The Internet Service Providers 
interpose the caching systems in the strategic places at the organization bound- 
aries. 
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In experimental research of the cache systems and construction of the math- 
ematical models should lead to the growth of the caching effectiveness with 
minimal financial expenditures. A better algorithm that increases hit ratios 
by only several percent would be equivalent to a multiple growth in cache 
size. In order to find the optimal cache size Kelly and Reeves [8] are guided 
on economical methods like monetary cost of memory and bandwidth. 



Our analysis is based on the model presented by Breslau et al. [2] and extended 
by Wolman et al. [11] to incorporate the steady-state behavior and documents' 
rate of change. One difficulty in the study of Web caching is that there are 
many cache replacement policies and many factors affecting their performance. 
Wolman et al. parameterized the model using population size, population re- 
quest rate, document rate of change, size of object universe, and popularity 
distribution of objects. Several researchers [7] include additional factors like 
object size, miss penalty, temporal locality, and long-term access frequency. 
Our research allows the calculation of the lower limit of the cache size corre- 
sponding to aggregated bandwidth of external links when performance mount 
to the effective level equal to 35%. 



Such an approach can be easily generalized to describe any applications based 
on Zipf-like distribution such as Content Distribution Networks (CDN) [6], 
peer-to-peer systems [9,10], Internet search engines, etc. 



The paper investigates how the different parameters of proxy cache influence 
on its performance. The methods of the measurements including the treatment 
of experimental data and analytical formulas for calculations are discussed. 
The correlations between the significant parameters are investigated and the 
corresponding Figures and Tables are constructed. On the basis of the new 
experimental data the analytical model is modernized and the addition to the 
cache construction and to replacement algorithm is proposed. 



Special consideration is given to the modification of the key formulas intro- 
duced by Wolman at al. [11]. In order to describe the renewal effect of Web 
documents in the global network the alternative model is developed. The doc- 
ument rate of change is supposed to depend on popularity index i so 
the Zipf-like distribution with the new exponent an describes the mentioned 
effects of the Wolman at al. model. As a result of the special experiment the 
values of exponents a, an and of the document rates of change /i u are 
calculated. 
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Fig. 1. Scheme of the proxy cache 



2 Plan of the measurements 



The scheme of the Web caching could be presented in the following way: there 
are users which send requests to the global network and receive information 
through a cache system as it is shown on the Fig. 1. Some documents are 
requested repeatedly and therefore they should be held in the cache system. 

The relative frequency [2] of requests to Web pages follows Zipf-like distribu- 
tion [12]. This distribution states that the relative probability of a request for 
the i'th most popular page is 



where A = 6\ is the probability of the most popular item and a is a positive 
exponential value less then unity. 

We proponed the following test: the size of proxy cache S e ff was varying. 
These values correspond to incoming traffic for one, two, three, and six days. 
All types of documents both cacheable and uncacheble are taken into account 
for calculating the ratio S e ffjv int . 

The network of Samara State Aerospace University has been chosen as an 
experimental field. The proxy cache of SSAU is a two-processor Linux server 
with SQUID proxy installed. All hierarchical links were disconnected before 
the experiment began. The statistics of requests for the long time T st > month 
were collected for each point. 

In order to modify the model proposed by Wolman at al. we collected the 
requests during the time T st < t u . The t u is the mean lifetime of those docu- 
ments, who's popularity is i?j = 1 ($j = 6ik). 
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Primary results 
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3 The original results and their processing 



The primary results of experiments are summarized in the Table 1, where the 
following notations and abridgements are used: 

• The variables u int and u out are the incoming and the outgoing request's 
streams of cache proxy as it is shown on the Fig. 1. They are measured as 
the number of users requests per day (Rpd - request per day). It should 
be noted that the variable v out = XN describes the request stream from a 
collective user that is a significant parameter of the Wolman at al. model. 

• H is the performance of the cache system or, in other words, hit ratio. 

• The same variables with upper index B describe the system in the units of 
transmitting traffic {Kbps - Kbit per second). 

• E(S) is the mean size of documents received from the global network di- 
rectly. 

• E(C) is the mean size of documents from the cache. 

• Finally, the time T st corresponds to the quantity of days when statistics are 
collected. 

The statistics collected were processed by scripts specially written for the task. 
Originally, so-called cacheable documents that can be stored in a proxy are 
selected from the general list. Later the corresponding Zipf-like distribution 
was constructed where the documents were placed in the order of reducing of 
popularity index <di. The fragment of this list is shown bellow: 

112 http://www.ixbt.com/imagcs/cmpty.gif (line 457), i.e. $457 = 112 

111 http://cachcscrvcr.myccom.net/main/imagcs/adlogo.jpg (line 458) 

2 http://zzz.net.ru/imagcs/spacer.gif (line 78166=M) 

1 http://www.muz-tv.ru/chat/chat-top.html (line 200045) 

The number of unique cacheable documents or the quantity of the lines in the 
list mentioned above is p. The line number of the last document, which was 
requested twice (d M = 2), is M. 
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Table 2 



Parameters of proxy cache 
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In order to find the general number of cacheable documents k the following 
sum was calculated 

fc = Eft ( 2 ) 

i=i 

where i?j is the number of cache requests for the i'th most popular document. 
The portion of cacheable documents p c is defined as 

k = VcVoutTsu (3) 

where K = v ou tTst is the general number of all documents received from the 
global network including both cacheable and uncacheable ones. 

The value of a's was calculated using the equation 

M 

a = 1 -2M/J2#i = 1 -2M/(k-p + M) (4) 
%=\ 

Finally, analyzing the log files mean lifetime t u was calculated for those docu- 
ments, with a popularity of = 1. Such statistics also determine the lifetime 
T e ff of cache objects with the citing index i?, = 2, i.e. those items, which have 
been stored in proxy cache, requested one time from a proxy, and deleted 
subsequently (see Tab. 2). 



4 Basic correlation 

The first family of the curves, which should be analyzed, is the dependence of 
t u and T e ff on cache size S e ff/v int . As it was discovered in the Ref. [5], these 
curves define the ratios between elements of cache construction: 

• A kernel Sk that contains popular documents with i?j > 2. 

• An accessory part S u that keeps unpopular documents requested from the 
Internet once, i.e. §i = 1. 



5 



Days 
25 

20 



Fig. 2. Dependence between lifetimes and cache size 

• A managing part S m that contains statistics of requests and rules for re- 
placement of cache objects. 

Earlier [4,5] we have got the following Egs. 



l-a)H 



VoutTeff 



Sk ■ 

Z 

gfe = T eff = M T eff 

S u (2 1 / a -l)t u p-M t u 



(5) 
(6) 



Analysis of the experimental data shown in the Tab. 2 and on the Fig. 2 leads 
to the facts that variables t u and T e ff are directly proportional to the cache 
size Seff/vmt and can be considered as coincided values: 



T, 



eff 



(7) 



The only deflection indicated for effectiveness of replacement algorithms is 
revealed at small cache size when performance is far from the optimal value. 

In other words the kernel and accessory parts are approximately correlated as 
1:2 or less then 40% of storage capacity has been used for the basic goal to 
store the repeatedly requested documents. It is remarkable that the number 
of documents M, which must be stored in the cache system for one and two 
months traffic, is less then three and six-days incoming stream correspondingly. 

The second family of curves intended for study is the dependence of the system 
performance H and H B on its relative size S e ff/vi nt) see Fig 3. 

The hit rate H of the web cache is considered to grow in a log-like fashion as 
a function of cache size [1,2,3]. From the expression for cache performance 

i 
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Fig. 3. Dependence between hit ratios and cache size 
the following dependence appears 



Hi 
H 2 



l-a 



Si 

s, j • 



(9) 



that allows us to talk about power fashion. 

The Fig 3 illustrates the fact that the dependence between H and S e ff/ui n t 
is successfully described by Eq. (9) with a = 0.77 and the curve H B is not 
predictable. This effect needs additional study especially because the Tab. 1 
shows a positive difference between the mean size of cacheable documents 
E(S) and the mean size of all items E(C). 



5 Renewal of Web documents 



Rapid development of computer technology at the end of the last millennium 
lead to the appearance of a virtual world with its own laws. Unfortunately, 
during this period little attention was given to studying of fundamental prin- 
ciples of virtual life. 

Nowadays we can afford some respite and some resources should be transferred 
to studying of the fundamental laws and to optimization of the vital systems 
on the basis of recent knowledge. Frankly speaking, any research area may be 
considered as a scientific field if and only if its basic principles are enveloped in 
mathematical form and new rules can be predicted on the basis of confirmed 
facts. 

Our analysis is based on the model presented by Breslay et al [2] and extended 
by Wolman et al. [11] to incorporate document rate of change. The Wolman 
model yields formulas to predict steady-state properties of Web caching sys- 
tems parameterized by population size N, population request rates A, docu- 
ment rate of change /x, size of object universe n and a popularity distribution 
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Fig. 4. The renewal effect 



for objects. 



The key principles of their theoretical study has been developed in our work [5] 
to describe the basic effects as a ratio between construction parts, steady-state 
performance, optimal size, etc. in an alternative way. We want to outline that 
the results received and the plan of the experiment follow from the theoretical 
model [5]. 



The key formulas from Wolman et al. 



C 



N [ 



CrrOt \ 1 I fJ,Cx° 

i V 1 + 



dx 



C= I —dx 

x a 



(10) 
(11) 



yield Cn, the aggregate object hit ratio considering only cacheable objects. 
Document rates of change // were considered to take two different values, one 
for popular documents \i v and another for unpopular documents \i u . An addi- 
tional multiplier from Eq. (10) throws off one effective query to any document 
from cache during time T ch = l//i between its changes. Such an updating 
request must be redirected to the global network to get the renewed Web 
page. 

In this paper we assume that the document rate of change /z(i) depends on its 
popularity i. The mathematical equivalent of this assertion is that the steady- 
state process is again described by Zipf-like distribution with a small otR as it 
is shown of the Fig. 4. 

The difference between an ideal performance of a cache system and its real 
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Table 3 

The renewal parameters 
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value has been conditioned by the renewal of documents in the global network 

(12) 



and can be investigated in an experimental way. Here 



M 



Ak = J2#i- HK = k- HK + M -p, 



(13) 



Ak is the number of updating requests conditioned by renewal of Web docu- 
ments and 



k R = HK +p- M 



(14) 



Therefore we have to explore the log file collected for the time T st < T e ff — AT 
using the data from the Tab. 2. Then the condition M(T st ) < has been 
fulfilled and all repeated requests are explained by the renewal effect. The 
ideal Zipf-like distribution corresponds to the upper line on the Fig. 4 with 
exponent a from Eq. (4), but the real hit ratio H determines a R as 



2M/HK 



(15) 



Now it is easy to find //j 



St 



(16) 



St 



or 



1 1 - (i/p) Aa 
Ts~t {i/p) a 



(17) 



We can consider fx u = /x(p/4) and fx p = /x(p/100), then the results can be 
summarized in the Table 3. 

The key difference between our analytical model and presented by Wolman at 
al. model are: 
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• The document rate of change is described by continuous variables that 
depend on popularity i. Wolman et al. assume that the document rate of 
change /i takes two different values, one for popular documents ji p and 
another for unpopular documents fi u . 

• The renewal of Web documents is incorporated in a steady-state process as 
a Zipf-like distribution with a small (Xr. 

• We assume also that size of cache system is restricted. 

• A considerable part of cacheable documents p — M are requested from the 
global network only one time accordingly to Zipf-like distribution. The more 
accurate expression for an ideal hit ratio Hi has been shown in Ref. [5] 

Hi < 2 < - a " 1)/a , (18) 

It has to be modified to taking into account the renewal effect: 

H t <2^ a (l-a)/(l-a R ), (19) 

Two types of correlations pretend to a role of fundamental laws that describe 
cache systems [4]: 

• A Zipf-like distribution, see Eq. (1) 

• Normalizing conditions or a sum of the probability to request the universe 
of 1 < n < k objects. 

The above mentioned laws could be applied to the special points of Zipf distri- 
bution and two of them, M and p, are used for the construction of the theory. 
Then the Zipf-like distribution leads to 



^ = 1. (21) 
Normalizing conditions for the first M and p documents from cache give 



1 A 



J—dx^Hi (22) 
i 

p . 

/ dx=l (23) 

1 

Here Hi is an ideal (steady-state) performance for cacheable documents. For 
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a real system the Eq. (22) has been transformed to 

H = Pc J —dx, (24) 



where Sk is the number of cache objects in the kernel. 

It is important that Zipf exponent a grows as the experiment time T st in- 
creases. 

Aa = a(T%) - a{T} t ) = Cln(7*/7*) (25) 



6 Summary and future work 



The aim of this paper is a study of a Web cache system in order to optimize 
proxy cache systems and to modernize construction principles. Our investi- 
gations lead to the criteria for the optimal usage of storage capacity and to 
allow to description of the basic effects as the ratio between the construction 
parts, steady-state performance, optimal size, etc. We want to outline that 
the results received and the plan of the experiment follow from the theoretical 
model. 

Special consideration is given to the modification of the key formulas supposed 
by Wolman at al. [11]. The document rate of change is supposed to depend 
on popularity index % so the Zipf-like distribution with the new exponent ocr 
describes the effects of the renewal of Web documents. 

The main result of any research of cache system is finding a way for increasing 
hit ratio. We can conclude that with growth of Sk/S u the hit rate H increases. 
A general feature of the current algorithm is the only documents requested 
two times and more during the time t u are included in the kernel Sk- It is 
a fact that M less then the cache size S e ff, i.e. all necessary items could be 
stored in a cache system at the second, third and forth experimental points 
on the Tab. 2. Therefore we can make the conclusion that a caching algorithm 
based on a rigid tie of the mean parameters to the time t u is ineffective. 

One method of resolving the current situation is reconstructing of cache sys- 
tems. Such a construction suggested in the Sec. 4 must be implemented to 
provide a rigid ratio between its elements. The requests' statistics must be 
kept even for those cacheable documents that have been deleted from the 
cache. This statistics should exist for long time. This is one of the key dif- 
ferences between our construction and the existing ones that usually operate 
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only with the current set of documents in cache. An inseparable part of the 
new construction is the replacement algorithm based on Zipf law. Now we 
have prepared the corresponding application for an international patent. 

An additional question, which will demanded a new experimental search, is 
the dependence that describes the renewal of Web documents. Probably, the 
ratio Aa/a could be considered as a universal constant. The next direction 
for our development plans is an investigation of a model of cache interaction 
in hierarchical caching system. 
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